fix: MCP client auto reconnect when MCP session is not available#8682
fix: MCP client auto reconnect when MCP session is not available#8682SlightDust wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the automatic reconnection logic in mcp_client.py by replacing the tenacity retry decorator with a custom retry loop that handles specific transport and connection errors. However, several critical issues were identified in the review: a syntax error was introduced where self.session.call_tool was omitted, top-level imports of optional dependencies (McpError and anyio) will cause import failures if they are not installed, and catching BaseException instead of Exception risks intercepting system-level exceptions like KeyboardInterrupt.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| stop_after_attempt, | ||
| wait_exponential, | ||
| ) | ||
| from mcp.shared.exceptions import McpError |
There was a problem hiding this comment.
The import of McpError from mcp.shared.exceptions is performed at the top level. However, mcp is an optional dependency in this codebase (as indicated by the conditional import block on lines 91-98). If mcp is not installed, importing mcp_client.py will fail immediately with a ModuleNotFoundError at startup.
Please remove this top-level import and instead import McpError dynamically or handle its import failure gracefully.
| _TRANSPORT_ERRORS: tuple[type[Exception], ...] = ( | ||
| anyio.ClosedResourceError, | ||
| anyio.BrokenResourceError, | ||
| anyio.EndOfStream, | ||
| ConnectionError, | ||
| ConnectionResetError, | ||
| BrokenPipeError, | ||
| httpx.HTTPStatusError, | ||
| httpx.ConnectError, | ||
| httpx.ReadError, | ||
| httpx.RemoteProtocolError, | ||
| McpError, | ||
| ) |
There was a problem hiding this comment.
Defining _TRANSPORT_ERRORS with anyio.ClosedResourceError and McpError at the module level will raise a NameError or ModuleNotFoundError if anyio or mcp are not installed (since they are optional dependencies imported conditionally).
To prevent this, we should construct _TRANSPORT_ERRORS dynamically by safely attempting to import these optional exceptions.
_transport_errors_list: list[type[Exception]] = [
ConnectionError,
ConnectionResetError,
BrokenPipeError,
httpx.HTTPStatusError,
httpx.ConnectError,
httpx.ReadError,
httpx.RemoteProtocolError,
]
try:
import anyio
_transport_errors_list.extend([
anyio.ClosedResourceError,
anyio.BrokenResourceError,
anyio.EndOfStream,
])
except ImportError:
pass
try:
from mcp.shared.exceptions import McpError
_transport_errors_list.append(McpError)
except ImportError:
pass
_TRANSPORT_ERRORS = tuple(_transport_errors_list)| ) | ||
|
|
||
|
|
||
| def _is_connection_error(exc: BaseException) -> bool: |
| reraise=True, | ||
| ) | ||
| async def _call_with_retry(): | ||
| last_exc: BaseException | None = None |
| read_timeout_seconds=read_timeout_seconds, | ||
| ) | ||
| except anyio.ClosedResourceError: | ||
| except BaseException as e: |
There was a problem hiding this comment.
Catching BaseException is generally discouraged because it intercepts system-initiating exceptions like KeyboardInterrupt, SystemExit, and asyncio.CancelledError. This can lead to unexpected behavior or prevent clean shutdowns/cancellations.
Since all connection and transport-related exceptions inherit from Exception, we should catch Exception instead.
except Exception as e:There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
call_tool_with_reconnect, the line that actually invokes the tool (self.session.call_tool(...)) appears to have been removed, soreturn await (will raise a syntax/runtime error; you likely wantreturn await self.session.call_tool(...)as before with the new retry logic around it. - The new manual retry loop removes the previous exponential backoff and detailed tenacity logging; if repeated reconnects are expected in some deployments, consider adding a small delay or backoff between attempts and ensuring failures are logged in a way that preserves the previous observability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `call_tool_with_reconnect`, the line that actually invokes the tool (`self.session.call_tool(...)`) appears to have been removed, so `return await (` will raise a syntax/runtime error; you likely want `return await self.session.call_tool(...)` as before with the new retry logic around it.
- The new manual retry loop removes the previous exponential backoff and detailed tenacity logging; if repeated reconnects are expected in some deployments, consider adding a small delay or backoff between attempts and ensuring failures are logged in a way that preserves the previous observability.
## Individual Comments
### Comment 1
<location path="astrbot/core/agent/mcp_client.py" line_range="666-669" />
<code_context>
try:
- return await self.session.call_tool(
+ return await (
name=tool_name,
arguments=arguments,
read_timeout_seconds=read_timeout_seconds,
)
- except anyio.ClosedResourceError:
</code_context>
<issue_to_address>
**issue (bug_risk):** The call to `call_tool` is missing the target coroutine (`self.session.call_tool`), making this invalid and likely a runtime error.
Previously this block correctly called `self.session.call_tool(...)`, but now it just awaits a parenthesized list of keyword arguments, which is invalid syntax and will fail at runtime. Please restore the explicit call, e.g.
```python
return await self.session.call_tool(
name=tool_name,
arguments=arguments,
read_timeout_seconds=read_timeout_seconds,
)
```
</issue_to_address>
### Comment 2
<location path="astrbot/core/agent/mcp_client.py" line_range="671" />
<code_context>
read_timeout_seconds=read_timeout_seconds,
)
- except anyio.ClosedResourceError:
+ except BaseException as e:
+ last_exc = e
+
+ if not _is_connection_error(e):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Catching `BaseException` may swallow cancellations and other critical signals; consider restricting to `Exception`.
This will also catch `asyncio.CancelledError`, `KeyboardInterrupt`, and `SystemExit`. In async code that can break task cancellation and shutdown, especially if this layer retries or suppresses the error. Limiting the handler to `Exception` and letting cancellations/termination signals propagate is safer:
```python
except Exception as e:
...
```
```suggestion
except Exception as e:
```
</issue_to_address>
### Comment 3
<location path="astrbot/core/agent/mcp_client.py" line_range="125-133" />
<code_context>
+)
+
+# Keywords in exception messages that indicate a transport/connection problem.
+_CONNECTION_ERROR_KEYWORDS = (
+ "session terminated",
+ "terminated",
+ "closed resource",
+ "broken resource",
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The generic keyword "terminated" may falsely classify unrelated errors as connection issues.
This broad substring match risks treating unrelated lifecycle or business errors as transport failures, triggering unnecessary reconnects. Consider restricting matching to more specific phrases (e.g., variants of the existing "session terminated") or otherwise tightening the condition so it only captures genuine connection/transport errors.
```suggestion
# Keywords in exception messages that indicate a transport/connection problem.
# Note: avoid overly generic substrings (e.g. just "terminated") to prevent
# misclassifying unrelated lifecycle or business errors as connection issues.
_CONNECTION_ERROR_KEYWORDS = (
"session terminated",
"session was terminated",
"closed resource",
"broken resource",
"connection closed",
"connection reset",
)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| read_timeout_seconds=read_timeout_seconds, | ||
| ) | ||
| except anyio.ClosedResourceError: | ||
| except BaseException as e: |
There was a problem hiding this comment.
suggestion (bug_risk): Catching BaseException may swallow cancellations and other critical signals; consider restricting to Exception.
This will also catch asyncio.CancelledError, KeyboardInterrupt, and SystemExit. In async code that can break task cancellation and shutdown, especially if this layer retries or suppresses the error. Limiting the handler to Exception and letting cancellations/termination signals propagate is safer:
except Exception as e:
...| except BaseException as e: | |
| except Exception as e: |
| # Keywords in exception messages that indicate a transport/connection problem. | ||
| _CONNECTION_ERROR_KEYWORDS = ( | ||
| "session terminated", | ||
| "terminated", | ||
| "closed resource", | ||
| "broken resource", | ||
| "connection closed", | ||
| "connection reset", | ||
| ) |
There was a problem hiding this comment.
suggestion (bug_risk): The generic keyword "terminated" may falsely classify unrelated errors as connection issues.
This broad substring match risks treating unrelated lifecycle or business errors as transport failures, triggering unnecessary reconnects. Consider restricting matching to more specific phrases (e.g., variants of the existing "session terminated") or otherwise tightening the condition so it only captures genuine connection/transport errors.
| # Keywords in exception messages that indicate a transport/connection problem. | |
| _CONNECTION_ERROR_KEYWORDS = ( | |
| "session terminated", | |
| "terminated", | |
| "closed resource", | |
| "broken resource", | |
| "connection closed", | |
| "connection reset", | |
| ) | |
| # Keywords in exception messages that indicate a transport/connection problem. | |
| # Note: avoid overly generic substrings (e.g. just "terminated") to prevent | |
| # misclassifying unrelated lifecycle or business errors as connection issues. | |
| _CONNECTION_ERROR_KEYWORDS = ( | |
| "session terminated", | |
| "session was terminated", | |
| "closed resource", | |
| "broken resource", | |
| "connection closed", | |
| "connection reset", | |
| ) |
Modifications / 改动点
MCP服务端重启的话,Astrbot这边的调用会持续报404,所以让ai改了重试机制,在发现MCP不可用时尝试重连3次。
fix #8681
Screenshots or Test Results / 运行截图或测试结果
Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Improve MCP client tool invocation resilience by broadening connection error detection and implementing a custom multi-attempt reconnect loop.
Bug Fixes: